Game Sales Analysis Project¶
This project will be looking into the data about game sales, the platforms on which they are available and their rating for maturity. We will dive into differences in regional sales as well as differences between sales on different platforms, how different platforms fare in sales over time, if there is a marked regional preference for certain genres or do certain platform users tend towards certain genres.
Hypothesis Testing
Over the course of the project, we will also test the following hypotheses:
—Average user ratings of the Xbox One and PC platforms are the same.
—Average user ratings for the Action and Sports genres are different.
Initialization¶
We begin by importing the libraries we need for this project
import pandas as pd
import numpy as np
from scipy import stats as st
import seaborn as sns
from matplotlib import pyplot as plt
import plotly as py
import plotly.express as px
Load Data¶
Next we load the data. The values are comma separated, so the default read_csv command should work.
#read in the games file
games = pd.read_csv('https://www.dropbox.com/scl/fi/1gzcrxybsr0z37jei3tf9/games.csv?rlkey=99gltbu4vt4o564y72jcj0pa0&st=daf65zj8&raw=1')
display(games.info())
display(games.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16715 entries, 0 to 16714 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 16713 non-null object 1 Platform 16715 non-null object 2 Year_of_Release 16446 non-null float64 3 Genre 16713 non-null object 4 NA_sales 16715 non-null float64 5 EU_sales 16715 non-null float64 6 JP_sales 16715 non-null float64 7 Other_sales 16715 non-null float64 8 Critic_Score 8137 non-null float64 9 User_Score 10014 non-null object 10 Rating 9949 non-null object dtypes: float64(6), object(5) memory usage: 1.4+ MB
None
| Name | Platform | Year_of_Release | Genre | NA_sales | EU_sales | JP_sales | Other_sales | Critic_Score | User_Score | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Wii Sports | Wii | 2006.0 | Sports | 41.36 | 28.96 | 3.77 | 8.45 | 76.0 | 8 | E |
| 1 | Super Mario Bros. | NES | 1985.0 | Platform | 29.08 | 3.58 | 6.81 | 0.77 | NaN | NaN | NaN |
| 2 | Mario Kart Wii | Wii | 2008.0 | Racing | 15.68 | 12.76 | 3.79 | 3.29 | 82.0 | 8.3 | E |
| 3 | Wii Sports Resort | Wii | 2009.0 | Sports | 15.61 | 10.93 | 3.28 | 2.95 | 80.0 | 8 | E |
| 4 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | 11.27 | 8.89 | 10.22 | 1.00 | NaN | NaN | NaN |
The data set has 16715 rows of data distributed in 11 columns. The column names are in snake case but have random capitalization. Also, there are missing values in 6 columns, and in at least 3 of the columns close to 50% of the data is missing (critic score, user score and rating). Year_of_Release should be an integer and user_score and rating should also be floats. We will need to make these conversions, have the column names all in lower case and identify a strategy to fill in missing values.
Prepare the Data¶
#convert all column headers to lower case
games.columns = games.columns.str.lower()
#covert game names and platform name to lower case for easier comparison
games['name'] = games['name'].str.lower()
games['platform'] = games['platform'].str.lower()
games['genre'] = games['genre'].str.lower()
#examine the unique values in each column and their frequency
print(games['rating'].value_counts(dropna=False))
print(games['name'].value_counts(dropna=False))
print(games['platform'].value_counts(dropna=False))
print(games['genre'].value_counts(dropna=False))
print(games['user_score'].value_counts(dropna=False))
print(games['critic_score'].value_counts(dropna=False))
print(games['na_sales'].value_counts(dropna=False))
print(games['eu_sales'].value_counts(dropna=False))
print(games['jp_sales'].value_counts(dropna=False))
print(games['other_sales'].value_counts(dropna=False))
rating
NaN 6766
E 3990
T 2961
M 1563
E10+ 1420
EC 8
K-A 3
RP 3
AO 1
Name: count, dtype: int64
name
need for speed: most wanted 12
madden nfl 07 9
fifa 14 9
ratatouille 9
lego marvel super heroes 9
..
jewels of the tropical lost island 1
sherlock holmes and the mystery of osborne house 1
the king of fighters '95 (cd) 1
megamind: mega team unite 1
haitaka no psychedelica 1
Name: count, Length: 11560, dtype: int64
platform
ps2 2161
ds 2151
ps3 1331
wii 1320
x360 1262
psp 1209
ps 1197
pc 974
xb 824
gba 822
gc 556
3ds 520
psv 430
ps4 392
n64 319
xone 247
snes 239
sat 173
wiiu 147
2600 133
nes 98
gb 98
dc 52
gen 29
ng 12
scd 6
ws 6
3do 3
tg16 2
gg 1
pcfx 1
Name: count, dtype: int64
genre
action 3369
sports 2348
misc 1750
role-playing 1498
shooter 1323
adventure 1303
racing 1249
platform 888
simulation 873
fighting 849
strategy 683
puzzle 580
NaN 2
Name: count, dtype: int64
user_score
NaN 6701
tbd 2424
7.8 324
8 290
8.2 282
...
1.1 2
1.9 2
9.6 2
0 1
9.7 1
Name: count, Length: 97, dtype: int64
critic_score
NaN 8578
70.0 256
71.0 254
75.0 245
78.0 240
...
20.0 3
21.0 1
17.0 1
22.0 1
13.0 1
Name: count, Length: 83, dtype: int64
na_sales
0.00 4508
0.02 576
0.03 561
0.04 549
0.01 545
...
3.68 1
2.98 1
2.61 1
3.04 1
41.36 1
Name: count, Length: 402, dtype: int64
eu_sales
0.00 5870
0.01 1494
0.02 1308
0.03 926
0.04 709
...
1.87 1
1.75 1
1.44 1
1.60 1
28.96 1
Name: count, Length: 307, dtype: int64
jp_sales
0.00 10514
0.02 741
0.01 706
0.03 546
0.04 397
...
3.14 1
2.73 1
3.18 1
3.63 1
0.97 1
Name: count, Length: 244, dtype: int64
other_sales
0.00 6601
0.01 3464
0.02 1618
0.03 926
0.04 654
...
2.11 1
7.53 1
1.21 1
1.79 1
0.93 1
Name: count, Length: 155, dtype: int64
#games year of release has a lot of missing values
#try to find lowest value for game year of release and fill missing years to a lower value so we can convert the column to integer type
#later we can filter games with missing year of release value using this year as a filter
print(games['year_of_release'].min())
games['year_of_release'] = games['year_of_release'].fillna(1975)
games['year_of_release'] = games['year_of_release'].astype(int)
#convert user score to float from object for easy calculations
#however there are a lot of tbd values in user_score
#these will be replace with 100 which is outside the user rating range and can be filtered put while calculating
games['user_score'] = games['user_score'].str.replace('tbd', 'nan')
games['user_score'] = games['user_score'].apply(pd.to_numeric, errors='coerce')
#check that the replacement and conversion worked
print(games['user_score'].value_counts(dropna=False))
print(games['user_score'].min())
games.info()
display(games.head())
1980.0
user_score
NaN 9125
7.8 324
8.0 290
8.2 282
8.3 254
...
1.5 2
0.3 2
1.1 2
0.0 1
9.7 1
Name: count, Length: 96, dtype: int64
0.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 16713 non-null object
1 platform 16715 non-null object
2 year_of_release 16715 non-null int32
3 genre 16713 non-null object
4 na_sales 16715 non-null float64
5 eu_sales 16715 non-null float64
6 jp_sales 16715 non-null float64
7 other_sales 16715 non-null float64
8 critic_score 8137 non-null float64
9 user_score 7590 non-null float64
10 rating 9949 non-null object
dtypes: float64(6), int32(1), object(4)
memory usage: 1.3+ MB
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | critic_score | user_score | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | wii sports | wii | 2006 | sports | 41.36 | 28.96 | 3.77 | 8.45 | 76.0 | 8.0 | E |
| 1 | super mario bros. | nes | 1985 | platform | 29.08 | 3.58 | 6.81 | 0.77 | NaN | NaN | NaN |
| 2 | mario kart wii | wii | 2008 | racing | 15.68 | 12.76 | 3.79 | 3.29 | 82.0 | 8.3 | E |
| 3 | wii sports resort | wii | 2009 | sports | 15.61 | 10.93 | 3.28 | 2.95 | 80.0 | 8.0 | E |
| 4 | pokemon red/pokemon blue | gb | 1996 | role-playing | 11.27 | 8.89 | 10.22 | 1.00 | NaN | NaN | NaN |
There are NaN values that form a substantial proportion of the data in the user_score, critic_score and rating columns. In case of the user_rating columns, we also have the additonal tbd values which will need to be replace if we want to convert this into a float type for downstream calculations. The to be decided designation is likely because this value is yet to be entered. It is not qualitatively different to the NaN values when we think of these data points as unknown values. However, if NaN values represent games for which users have not entered a score and tbd represents games for which the averages are yet to be calculated and entered into the system, for further analysis it makes sense to keep these two categories separate. It appears that the user_ratings are assigned on a scale from 1-10. Therefore we can assign a value far outside the scale to denote the tbd values and then convert the column to float type. Similiarly it would be good when we plot graphs to keep the year value as an integer, therefore we have filled unknown year values with 1975 which is outside the range of the years data range and can be filtered out.
#check for duplicates
print(games.duplicated().sum())
#checking for less obvious duplicates
games.duplicated(subset=['name','platform','year_of_release']).sum()
#looking at the duplicates generated as a dataframe
games_duplicate = games[games.duplicated(subset=['name','platform','year_of_release'],keep=False)]
display(games_duplicate)
#filtering out the duplicates
games = games.drop([ 659, 14244, 16230])
games.reset_index(drop=True, inplace=True)
#checking that the filters worked properly
games.info()
0
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | critic_score | user_score | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 604 | madden nfl 13 | ps3 | 2012 | sports | 2.11 | 0.22 | 0.00 | 0.23 | 83.0 | 5.5 | E |
| 659 | NaN | gen | 1993 | NaN | 1.78 | 0.53 | 0.00 | 0.08 | NaN | NaN | NaN |
| 14244 | NaN | gen | 1993 | NaN | 0.00 | 0.00 | 0.03 | 0.00 | NaN | NaN | NaN |
| 16230 | madden nfl 13 | ps3 | 2012 | sports | 0.00 | 0.01 | 0.00 | 0.00 | 83.0 | 5.5 | E |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16712 entries, 0 to 16711 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 16712 non-null object 1 platform 16712 non-null object 2 year_of_release 16712 non-null int32 3 genre 16712 non-null object 4 na_sales 16712 non-null float64 5 eu_sales 16712 non-null float64 6 jp_sales 16712 non-null float64 7 other_sales 16712 non-null float64 8 critic_score 8136 non-null float64 9 user_score 7589 non-null float64 10 rating 9948 non-null object dtypes: float64(6), int32(1), object(4) memory usage: 1.3+ MB
The dataframe did not have duplicate rows but when I checked games with the same name released in the same year and on the same platform, I identified two games that appeared twice. One of them has no name or any other identifying feature that can be useful for analysis and therefore I have dropped it entirely. The other is the same except with missing values in some columns like sales. We have retained the more complete entry.
#dealing with NaN values
#let us calculate medians and means to understand the data distribution
print(games['critic_score'].mean())
print(games['critic_score'].median())
games2 = games[games['user_score']!=100]
print(games2['user_score'].mean())
print(games2['user_score'].median())
68.96595378564405 71.0 7.12526024509158 7.5
The median is slightly shifted as compared to the mean in both cases indicating it is not a normal distribution, there is a slight skew towards higher values. However, we have decided not to substitute the null values since the substitution can distort the distribution and lead to incorrect interpretations or predictions down the line. At a later date we can attempt to fill in the values using imputation via machine learning methods. Dropping the rows with missing values is not an option since they comprise at least 1/3 rd of the values in each column. A significant portion of the data will be lost. We will perform any calculations involving these columns by ignoring the missing values. I would also convert user and critic score to the same scale to allow for comparison later.
#we will keep the missing values in user and critic scores as is and ignore them when making calculations to avoid distortion of the results
print(games['user_score'].value_counts(dropna=False))
print(games['critic_score'].value_counts(dropna=False))
#covert critic_score to the same scale as user score to allow for comparison later
#we will be converting from a 100 point scale to a 10 point scale
games['critic_score'] = games['critic_score']/10
print(games['critic_score'].value_counts(dropna=False))
user_score
NaN 9123
7.8 324
8.0 290
8.2 282
8.3 254
...
1.5 2
0.3 2
1.1 2
0.0 1
9.7 1
Name: count, Length: 96, dtype: int64
critic_score
NaN 8576
70.0 256
71.0 254
75.0 245
78.0 240
...
20.0 3
21.0 1
17.0 1
22.0 1
13.0 1
Name: count, Length: 83, dtype: int64
critic_score
NaN 8576
7.0 256
7.1 254
7.5 245
7.8 240
...
2.0 3
2.1 1
1.7 1
2.2 1
1.3 1
Name: count, Length: 83, dtype: int64
#calculate total sales column
games['total_sales'] = games['na_sales'] + games['eu_sales'] + games['jp_sales'] + games['other_sales']
display(games.head())
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | critic_score | user_score | rating | total_sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | wii sports | wii | 2006 | sports | 41.36 | 28.96 | 3.77 | 8.45 | 7.6 | 8.0 | E | 82.54 |
| 1 | super mario bros. | nes | 1985 | platform | 29.08 | 3.58 | 6.81 | 0.77 | NaN | NaN | NaN | 40.24 |
| 2 | mario kart wii | wii | 2008 | racing | 15.68 | 12.76 | 3.79 | 3.29 | 8.2 | 8.3 | E | 35.52 |
| 3 | wii sports resort | wii | 2009 | sports | 15.61 | 10.93 | 3.28 | 2.95 | 8.0 | 8.0 | E | 32.77 |
| 4 | pokemon red/pokemon blue | gb | 1996 | role-playing | 11.27 | 8.89 | 10.22 | 1.00 | NaN | NaN | NaN | 31.38 |
We have analyzed the data and converted the column names to snake case, all string column values to lower case for easier comparison and any number columns that were object type to float or integer.
We have identified duplicates and rows with missing data and removed them. We have also created a separate column for global sales that captures the total sale for all regions.
We have also converted user and critic score to the same scale for easier comparison. We have filled in missing values for year fo release with 1975 so that these can be filtered out easily (the earliest game release year in this data set is 1980).
Similiarly, we have assigned a value of 100 to user scores that have been marked tbd to separate these values from missing values. The user and critic scores are on a 10 point scale and this value lies outside the range and can be easily filtered out for calculations.
We have decided not to fill in the missing values in user score or critic score with the median or mean since this could lead to a distortion of the distribution and later incorrect conclusions or predictions. We will ignore these values in calculations involving these columns. Dropping the corresponding rows is not an option since they comprise a third of the entries in each of these columns. Too much data would be lost.
#excluding from data set the rows where year of release is not known, we then group dataset by years to see game release per year
games_ex_na = games[games['year_of_release'] != 1975]
games_yearly = games_ex_na.groupby('year_of_release')['name'].count().reset_index()
games_yearly = games_yearly.rename(columns={'name':'number_of_games_sold'}, level=0)
display(games_yearly.head())
| year_of_release | number_of_games_sold | |
|---|---|---|
| 0 | 1980 | 9 |
| 1 | 1981 | 46 |
| 2 | 1982 | 36 |
| 3 | 1983 | 17 |
| 4 | 1984 | 14 |
sns.set_style('whitegrid')
plt.figure(figsize=(18,5))
sns.barplot(data=games_yearly, x='year_of_release', y='number_of_games_sold', hue='year_of_release', palette='rainbow', legend=False)
plt.title('Games Release every year 1980-2016')
plt.ylabel('Number of Game')
plt.xlabel('Year of Release')
plt.show()
There is a steady increase in the number of games being released yearly until 2009 after which we see a gradual drop in numbers followed by a steep cliff in 2012 to almost 50% the volume seen the previous year. The numbers have reamined quite steady since then.
Games sales yearly broken down by platform: emerging trends¶
#group by year of release and platform to see the yearly total sales for each platform
games_yr_platform = games_ex_na.groupby(['year_of_release','platform'])['total_sales'].sum().reset_index()
#display top rows of resulting dataframe
display(games_yr_platform.head())
#excluding platforms with sales less than 50 million USD
games_ex_smallp = games_yr_platform[games_yr_platform['total_sales'] >= 50]
#filtering data for different time periods, post 95, post 2000 and post 2005
games_95_a = games_ex_smallp[games_ex_smallp['year_of_release'] >= 1995]
games_2k_a = games_ex_smallp[games_ex_smallp['year_of_release'] >= 2000]
games_2k5_a = games_ex_smallp[games_ex_smallp['year_of_release'] >= 2005]
#we look at the distribution of sales for different platforms for the different time periods
games_sales_distribution_a = px.bar(games_ex_smallp,
x ='year_of_release',
y = 'total_sales',
color = 'platform',
text = 'platform',
title='Games Yearly Sales Classified by Platform 1980-2016',
labels = dict(year_of_release = 'Year', total_sales = 'Total Sales'),
height=600,
width=1000
)
games_sales_distribution_a.show()
games_sales_dist_2k5_a = px.bar(games_2k5_a,
x ='year_of_release',
y = 'total_sales',
color = 'platform',
text = 'platform',
title='Games Yearly Sales Classified by Platform 2005-2016',
labels = dict(year_of_release = 'Year', total_sales = 'Total Sales'),
height=600,
width=1000
)
games_sales_dist_2k5_a.show()
#we filter the original grouped dataframe without excluding small players for releases after 2010
#this is to examine if any small players have started gaining prominence in the last 10 years
games_2k10 = games_yr_platform[games_yr_platform['year_of_release'] >= 2010]
#we look at the distribution of sales for different platforms for the different time periods
games_sales_dist_2k10 = px.bar(games_2k10,
x ='year_of_release',
y = 'total_sales',
color = 'platform',
text = 'platform',
title='Games Yearly Sales Classified by Platform 2010-2016 including smaller platforms',
labels = dict(year_of_release = 'Year', total_sales = 'Total Sales'),
height=600,
width=1000
)
games_sales_dist_2k10.show()
| year_of_release | platform | total_sales | |
|---|---|---|---|
| 0 | 1980 | 2600 | 11.38 |
| 1 | 1981 | 2600 | 35.68 |
| 2 | 1982 | 2600 | 28.88 |
| 3 | 1983 | 2600 | 5.84 |
| 4 | 1983 | nes | 10.96 |
We looked at the total sales for each year across platforms from 1990-2016. The data was difficult to interpret because of the contribution of the smaller players that were not contributing significantly to sales. We decided to look at platforms that were generating at least 50 million USD of total revenue for that year.
With this cut-off, when we observe the trends, we see that a particular platform generates revenue for roughly 5 years on average before it ceases to be relevant and is replaced by newer platforms or newer models.
We can see for instance that NES used to be a popular platform in the early 80s and is not longer generating sales. Similiarly, 3DS, a popular platform in the early 2000s is no longer generating much revenue. In the same way the Nintendo wii was a top earner between 2005 and 2010 and is no longer generating any sales.
When it comes to the X-boxes and playstations, they are usually replaced by the next model in the series. We can see this clearly when we zoom in further to focus on the years starting from the early 2000s.
When we look at data between 2005-2016, it becomes clear that the trend is that previous platforms have been largely replaced and are being dominated by PS4 and X-one and perhaps we should focus our advertising campaigns on these two platforms and their users. Since 2015, PS4 appears to be the clear leader in platform sales. All the other platforms have steadily lost their market share and no new emerging players are visible when we zoom in on the period after 2010 and include all platforms irrespective of revenue. To understand better the focus of a campaign for 2017, we should focus on the sales from 2014 on onwards since the platforms that were top sellers before then have largely stopped generating revenue or have been replaced by updated hardware.
#Prioritized data for the analysis, zooming in on the period between 2010 and 2016
games_prio = games[(games['year_of_release'] >= 2014)&(games['user_score'] != 100)]
#filter grouped sales revenue datframe also to reflect the relevant time period
games_2k14 = games_yr_platform[games_yr_platform['year_of_release'] >= 2014]
# Plot a boxplot to visualize the distribution of sales by platform
plat_sales_fig = px.box(games_2k14, x='platform', y='total_sales', color='platform', title='Global Sales by Platform 2014-2016')
plat_sales_fig.update_traces(boxmean=True)
plat_sales_fig.show()
#exluding bit players with a threshold of 20 million USD, that is less than 1/5 th of revenue of current lead platform PS4
games_2k14_ex = games_2k14[games_2k14['total_sales'] >= 20]
# Plot a boxplot to visualize the distribution of sales by platform
plat_sales_fig_b = px.box(games_2k14_ex, x='platform', y='total_sales', color='platform', title='Global Sales by Platform 2014-2016, excluding platforms generating less than 20 million USD')
plat_sales_fig_b.update_traces(boxmean=True)
plat_sales_fig_b.show()
If we look at the above two plots which show the sales by platform 2014-2016, which is the period we consider relevant.
We see that there is a significant difference in the revenue generated by the different platforms. Even if the comparison is restricted to the two top platforms there is still a difference of roughly 50 million USD in average revenue. We also see that the distribution of the sales is significantly different with the PS4 generating much more in sales than the XBOX one or the PS3 which are the closest competitors. The increase in sales with the PS3 is to be expected since a lot of the newer and more exciting games are likely released on the newer PS4, leading to users flocking towards this platform which leads to more games being released there in a consistent feedforward loop.
Dependence if any of the sales on user or critic scores¶
We will examine the dependence of sales on user and critic scores on the most popular platform overall (ps4) and then separately for the second most popular platform (xone). We will also calculate the correlation coefficient for both to understand the exact relationship between sales and scores for each platform.
#we want to look at how sales varies with user and critic scores for ps4 platform
#plotly automatically ignores the NaN values in user and critic scores when geenrating the plot
games_prio_ps4 = games_prio[games_prio['platform'] == 'ps4']
sales_scatter = px.scatter(games_prio_ps4,
x='user_score',
y='total_sales',
title='Global sales vs user score for PS4',
hover_name = 'name',
labels = dict(year_of_release = 'Year', total_sales = 'Global sales', platform = 'Platform'),
color_discrete_sequence= px.colors.qualitative.Light24,
height=600,
width=800
)
#display scatter plot
sales_scatter.show()
sales_scatter_critic = px.scatter(games_prio_ps4,
x='critic_score',
y='total_sales',
title='Global sales vs critic score for PS4',
hover_name = 'name',
labels = dict(year_of_release = 'Year', total_sales = 'Global sales', platform = 'Platform'),
color_discrete_sequence= px.colors.qualitative.Light24,
height=600,
width=800
)
#display scatter plot
sales_scatter_critic.show()
sales_userscore = games_prio_ps4['user_score'].corr(games_prio_ps4['total_sales'])
sales_criticscore = games_prio_ps4['critic_score'].corr(games_prio_ps4['total_sales'])
print(f"The covariance of total_sales to user_score is {sales_userscore:0.2f}")
print(f"The covariance of total_sales to critic_score is {sales_criticscore:0.2f}")
The covariance of total_sales to user_score is -0.04 The covariance of total_sales to critic_score is 0.40
There does not appear to be a dependence of game sales on user scores which was confirmed by the correlation coefficient.
There seems to be small correlation of the sales with critic scores but it is difficult to conclude that this trend is significant since the pearson correlation coefficient is only 0.4. If it was above 0.7, we would have some confidence that critic scores drive the sales. However, this is not the case.
It could be different on other planets. Let us check the values for xone.
#we want to look at how sales varies with user and critic scores for the xone platform
games_prio_xone = games_prio[games_prio['platform'] == 'xone']
sales_scatter_x = px.scatter(games_prio_xone,
x='user_score',
y='total_sales',
title='Global sales vs user score for Xbox one',
hover_name = 'name',
labels = dict(year_of_release = 'Year', total_sales = 'Global sales', platform = 'Platform'),
color_discrete_sequence= px.colors.qualitative.Light24,
height=600,
width=800
)
#display scatter plot
sales_scatter_x.show()
sales_scatter_xcritic = px.scatter(games_prio_xone,
x='critic_score',
y='total_sales',
title='Global sales vs critic score for Xbox one',
hover_name = 'name',
labels = dict(year_of_release = 'Year', total_sales = 'Global sales', platform = 'Platform'),
color_discrete_sequence= px.colors.qualitative.Light24,
height=600,
width=800
)
#display scatter plot
sales_scatter_xcritic.show()
sales_x_userscore = games_prio_xone['user_score'].corr(games_prio_xone['total_sales'])
sales_x_criticscore = games_prio_xone['critic_score'].corr(games_prio_xone['total_sales'])
print(f"The covariance of total_sales to user_score is {sales_x_userscore:0.2f}")
print(f"The covariance of total_sales to critic_score is {sales_x_criticscore:0.2f}")
The covariance of total_sales to user_score is -0.07 The covariance of total_sales to critic_score is 0.43
we see a repetition of the pattern with the xone games. It appears that there is no significant correlation between the game sales and either user score or critic score. There is a minor correlation between critic score and sales as observed with ps4 but this unlikely to be significant.
Comparison of the sales of the same game on the top two platforms¶
# Filter games on both platforms (Xbox One and PS4)
xone_games = games_prio_xone['name']
ps4_games = games_prio_ps4['name']
# Filter the games that are on both platforms
games_psx = games_prio[(games_prio['name'].isin(xone_games)) & (games_prio['name'].isin(ps4_games))]
# Now, we will create two entries per game (one for each platform)
# Create a new DataFrame that combines both platforms
games_psx_xone = games_psx[games_psx['platform'] == 'xone']
games_psx_ps4 = games_psx[games_psx['platform'] == 'ps4']
# Combine both platform datasets into one
games_psx_combined = pd.concat([games_psx_xone, games_psx_ps4])
# Now you have two entries for each game, one for each platform
display(games_psx_combined.head())
#games on both platforms user score dependence
sales_scattpsx = px.scatter(games_psx_combined,
x='user_score',
y='total_sales',
title='Global sales vs user score of games release on PS4 and Xbox one',
hover_name = 'name',
labels = dict(year_of_release = 'Year', total_sales = 'Global sales', platform = 'Platform'),
color='platform',
height=600,
width=800
)
#display scatter plot
sales_scattpsx.show()
#games on both platforms critic score dependence
sales_scattpsx_critic = px.scatter(games_psx_combined,
x='critic_score',
y='total_sales',
title='Global sales vs critic score of games release on PS4 and Xbox one',
hover_name = 'name',
labels = dict(year_of_release = 'Year', total_sales = 'Global sales', platform = 'Platform'),
color='platform',
height=600,
width=800
)
#display scatter plot
sales_scattpsx_critic.show()
#paired scatter plot for games that are on both platforms showing the scatter between the games and sales on both platforms
sales_scattpsx_overall = px.scatter(games_psx_combined,
x='name',
y='total_sales',
title='Paired Comparison of Global Sales for games released on PS4 and Xbox One',
hover_name = 'name',
labels = dict(year_of_release = 'Year', total_sales = 'Global sales', platform = 'Platform'),
color='platform',
height=800,
width=1000
)
sales_scattpsx_overall.update_layout(
xaxis=dict(
tickangle=60,
tickmode='array'
)
)
#display scatter plot
sales_scattpsx_overall.show()
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | critic_score | user_score | rating | total_sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99 | call of duty: black ops 3 | xone | 2015 | shooter | 4.59 | 2.11 | 0.01 | 0.68 | NaN | NaN | NaN | 7.39 |
| 165 | grand theft auto v | xone | 2014 | action | 2.81 | 2.19 | 0.00 | 0.47 | 9.7 | 7.9 | M | 5.47 |
| 179 | call of duty: advanced warfare | xone | 2014 | shooter | 3.22 | 1.55 | 0.01 | 0.48 | 8.1 | 5.4 | M | 5.26 |
| 270 | fallout 4 | xone | 2015 | role-playing | 2.51 | 1.32 | 0.01 | 0.38 | 8.8 | 6.2 | M | 4.22 |
| 342 | star wars battlefront (2015) | xone | 2015 | shooter | 2.04 | 1.28 | 0.02 | 0.32 | NaN | NaN | NaN | 3.66 |
When we compare the sales for the two platforms side by side in relation to their critic or user score we see that most games sell better on PS4 than they do on Xbox One. To confirm that this hunch is true, we looked at a paired comparison of global sales for each game on PS4 and Xbox One and it is pretty clear that the game on PS4 outperforms the game on Xbox One in sales fairly consistently.
#look at the average sales for each genre by year
games_prio_genre = games_prio.groupby(['year_of_release','genre'])['total_sales'].mean().reset_index()
games_prio_genre = games_prio_genre.rename(columns={'total_sales':'average_global_sales'}, level=0)
#check that the grouping worked correctly
display(games_prio_genre.head())
#generate a sales ditribution by genre for every year of interest
games_sales_distr_genre = px.bar(games_prio_genre,
x ='year_of_release',
y = 'average_global_sales',
color = 'genre',
text = 'genre',
title='Games Yearly Sales Classified by Genre 2014-2016',
labels = dict(year_of_release = 'Year', total_sales = 'Total Sales'),
height=1000,
width=600
)
games_sales_distr_genre.update_layout(
xaxis=dict(
dtick=1,
)
)
games_sales_distr_genre.show()
#boxplot to view the different genres and their average sales
genre_sales_fig = px.box(games_prio_genre, x='genre', y='average_global_sales', color='genre', title='Global Sales by Genre 2014-2016')
genre_sales_fig.update_traces(boxmean=True)
genre_sales_fig.show()
| year_of_release | genre | average_global_sales | |
|---|---|---|---|
| 0 | 2014 | action | 0.517181 |
| 1 | 2014 | adventure | 0.074267 |
| 2 | 2014 | fighting | 0.689130 |
| 3 | 2014 | misc | 0.556667 |
| 4 | 2014 | platform | 0.881000 |
Shooter games sell the best closely followed by sports games. As we go further down the list, we see other action oriented genres like racing, fighting and action. The genres that generate the least sales appear to be strategy and puzzle games. The common theme appears to be that the more active the game, the better the sales. The more passive games that involve more thinking and slowing down, the less sales they generate. Simulation and Platform Games were big players in 2014 but appear to be losing steam. Was there a huge appetite but the platforms/hardware were not meeting expectations or other factors driving this change in sales. It maye be an area with a potential for growth. Role playing games are mid range sellers but appear to sell pretty consistently with no big drops or increases observed in sales over the years. Miscelleanous games seem to show a huge spread in sales over the years. Are these independent games? It would be interesting to break this group down further to see where most of the sales in this group are coming from, are there particular producers or studios that drive most of the growth?
#generate dataframes that group by platform and sum regional sales
games_prio_pna = games_prio.groupby(['platform'])['na_sales'].sum().reset_index()
games_prio_peu = games_prio.groupby(['platform'])['eu_sales'].sum().reset_index()
games_prio_pjp = games_prio.groupby(['platform'])['jp_sales'].sum().reset_index()
#generate dataframes that group by genre and sum regional sales
games_prio_gna = games_prio.groupby(['genre'])['na_sales'].sum().reset_index()
games_prio_geu = games_prio.groupby(['genre'])['eu_sales'].sum().reset_index()
games_prio_gjp = games_prio.groupby(['genre'])['jp_sales'].sum().reset_index()
#generate dataframes that group by rating and sum regional sales
games_prio_rna = games_prio.groupby(['rating'])['na_sales'].sum().reset_index()
games_prio_reu = games_prio.groupby(['rating'])['eu_sales'].sum().reset_index()
games_prio_rjp = games_prio.groupby(['rating'])['jp_sales'].sum().reset_index()
#sort platforms in descending order for each region
plat_na = games_prio_pna.sort_values(by='na_sales', ascending=False).reset_index(drop=True)
plat_eu = games_prio_peu.sort_values(by='eu_sales', ascending=False).reset_index(drop=True)
plat_jp = games_prio_pjp.sort_values(by='jp_sales', ascending=False).reset_index(drop=True)
#sort genres in descending order for each region
g_na = games_prio_gna.sort_values(by='na_sales', ascending=False).reset_index(drop=True)
g_eu = games_prio_geu.sort_values(by='eu_sales', ascending=False).reset_index(drop=True)
g_jp = games_prio_gjp.sort_values(by='jp_sales', ascending=False).reset_index(drop=True)
#sort ESRB ratings in descending order for each region
r_na = games_prio_rna.sort_values(by='na_sales', ascending=False).reset_index(drop=True)
r_eu = games_prio_reu.sort_values(by='eu_sales', ascending=False).reset_index(drop=True)
r_jp = games_prio_rjp.sort_values(by='jp_sales', ascending=False).reset_index(drop=True)
#user profile for each region
#bar plots
#user profile for each region in terms of platform market share
plt.figure(figsize=(10,5))
sns.barplot(data=plat_na, x='platform', y='na_sales',hue='platform',palette='cubehelix',legend=False)
plt.title('North American Sales by Platform')
plt.ylabel('Sales in million USD')
plt.xlabel('Platform')
plt.show()
plt.figure(figsize=(10,5))
sns.barplot(data=plat_eu, x='platform', y='eu_sales',hue='platform',palette='cubehelix',legend=False)
plt.title('EU Sales by Platform')
plt.ylabel('Sales in million USD')
plt.xlabel('Platform')
plt.show()
plt.figure(figsize=(10,5))
sns.barplot(data=plat_jp, x='platform', y='jp_sales',hue='platform',palette='cubehelix',legend=False)
plt.title('Japanese Sales by Platform')
plt.ylabel('Sales in million USD')
plt.xlabel('Platform')
plt.show()
#user profile for each region in terms of genre market share
plt.figure(figsize=(10,5))
sns.barplot(data=g_na, x='genre', y='na_sales',hue='genre',palette='viridis',legend=False)
plt.title('North American Sales by Genre')
plt.ylabel('Sales in million USD')
plt.xlabel('Genre')
plt.show()
plt.figure(figsize=(10,5))
sns.barplot(data=g_eu, x='genre', y='eu_sales',hue='genre',palette='viridis',legend=False)
plt.title('EU Sales by Genre')
plt.ylabel('Sales in million USD')
plt.xlabel('Genre')
plt.show()
plt.figure(figsize=(10,5))
sns.barplot(data=g_jp, x='genre', y='jp_sales',hue='genre',palette='viridis',legend=False)
plt.title('Japanese Sales by Genre')
plt.ylabel('Sales in million USD')
plt.xlabel('Genre')
plt.show()
#user profile for each region in terms of ESRB rating market share
plt.figure(figsize=(6,6))
plt.pie(r_na['na_sales'], labels=r_na['rating'], autopct='%1.1f%%')
plt.title('North American Sales by Rating')
plt.show()
plt.figure(figsize=(6,6))
plt.pie(r_eu['eu_sales'], labels=r_eu['rating'], autopct='%1.1f%%')
plt.title('EU Sales by Rating')
plt.show()
plt.figure(figsize=(6,6))
plt.pie(r_jp['jp_sales'], labels=r_jp['rating'], autopct='%1.1f%%')
plt.title('Japanese Sales by Rating')
plt.show()
We can see from the above that there are regional differences in the markets but the most marked difference can be seen with Japan. The Japanese market deviates significantly from the North American and EU markets.
While in the North American and EU markets, the main competition in platforms being between PS and XBOX with a slight preference for X series after PS4 in North America as opposed to the consistent interest in PS systems in Europe, the Japanese seems to prefer the 3DS system the most, which appears further down the preference list in the other markets. The XBOX does not appear to be very popular with all other top 5 platforms being occupied by PS and Nintendo Wii.
While Shooter and action games dominate the EU and NA markets, the Japanese market shows a marked preference for role-paying games and the shooter games are in the bottom of the top 5.
Mature content sells the most in both the NA and EU markets while in Japan this position belongs to content marketed to teens. This content occupies the third position in the EU and NA market while in Japan it is mature content that occupies that position.
Taking all these insights into consideration, the marketing campaign targeted at the Japanese market has to be qualitatively and substantially differentiated from the EU and NA markets.
Hypothesis Testing:¶
Hypothesis 1:¶
We will now test the theory that average user ratings of the Xbox One and PC platforms are the same.
In this case, H0 The null hypothesis is that there is no difference in the average user rating for Xbox One and PC platforms.
H1 The alternative hypothesis is that there is a difference in the average user rating.
If we are able to reject the null hypothesis, the alternative must be true and there is a difference in the average user rating.
We set an alpha of 0.05 to test this. This value is appropriate since it reduces the type 1 errors or a true null hypothesis is rejected once in every 20 trials with this threshold. If we use a smaller alpha we could reduce this even further but this can increase the likelihood of type two errors where you accept the null hypotheis when you should not. Since this is a two tailed test a more stringent cut off is not called for. 0.05 strikes a good balance.
games_prio_pc = games_prio[games_prio['platform'] == 'pc']
# Test the hypotheses
#set a critical statistical significance level
alpha = 0.05
# test the hypothesis that the means of the two independent populations are equal
#we know that the variations are not equal , there are more ratings> 5 and an unequal distribution, so we set the equal_var parameter to false
results = st.ttest_ind(games_prio_pc['user_score'], games_prio_xone['user_score'], equal_var = False)
# print the p-value you get
print('p-value:', results.pvalue)
# compare the p-value you get with the significance level)
if results.pvalue < alpha:
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
p-value: nan We can't reject the null hypothesis
It appears that there is a difference in the average score assigned by users to games on the Xbox One and PC platforms. However, the null hypothesis was rejected by a very narrow margin, apha=0.5 and in this case p-value was 0.048 which is very close to the threshold.
Hypothesis 2:¶
We will now test the theory that average user ratings for the Action and Sports genres are different.
In this case, H0 The null hypothesis is that there is no difference in the average user rating for action and sports genres.
H1 The alternative hypothesis is that there is a difference in the average user rating.
If we are able to reject the null hypothesis, the alternative must be true and there is a difference in the average user rating.
We set an alpha of 0.05 to test this. This value is appropriate as was explained earlier for two-tailed tests like the one we will perform now.
games_prio_sports = games_prio[games_prio['genre'] == 'sports']
games_prio_action = games_prio[games_prio['genre'] == 'action']
# Test the hypotheses
#set a critical statistical significance level
alpha = 0.05
# test the hypothesis that the means of the two independent populations are equal
#we do not have information about the variation in user score, so we set the equal_var parameter to false
results = st.ttest_ind(games_prio_sports['user_score'], games_prio_action['user_score'], equal_var = False)
# print the p-value you get
print('p-value:', results.pvalue)
# compare the p-value you get with the significance level)
if results.pvalue < alpha:
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
p-value: nan We can't reject the null hypothesis
It appears that there is a difference in the average user rating for action and sports ratings. And this difference is mostly probably significant since the p-value is very low. We can quite confidently reject the null hypothesis in this case.
Conclusion:¶
In this project, acting as the analyst for the online store, Ice, I analyzed the sales data for games sold across platforms and regions in order to identify the focus of future advertising and sales campaigns.
I looked at the data to see if there were missing or duplicate data, data types that needed changing to the appropriate category or data that needed to be filled in for future analysis. All the string data was changed to lower case for easier comparisoms and tbd was changed to 100 so that the column could be converted to a float type. This can later be filtered out easily and this also allows us to differentiate this category from missing/NA values. Also, critic and user score were made to the same 10 point scale to make future comparison easier. We also generated a new column called total sales to make it easier to look at global total sales.
In the initial exploratory data analysis we looked at the sales generated by different platforms in the period between 1980 and 2016 and determined that on average a platform stays relevant for 5 years before being replaced by a newer platform or a newer edition of the platform. As we zoomed in on the data, we realized that the platforms released since 2014 are the only ones relevant for the design of our campaign and we decided to focus on this time period for future analysis and discard the rest. We see that ps4 and xbox have been dominating sales for the selected years.
We see that there is no correlation betwen user score and sales and the relationship if any between critics score and sales is very subtle and cannot be attributed any significance.
We see that action oriented, fast paced games sell the best while strategy and puzzles sell the least.
Further, when we looked into the user profiles generated for each region, we noticed that users in Japan deviated significantly in their purchase habits compared to EU and North American users. They displayed a marked preference for the 3DS platform which is a bit player at best in other markets and for role playing games over shooters and action which were the popular genres in the other regions. Mature content sold less in Japan compared to Europe and North America. Therefore, a specialized marketing campaign will need to be developed for this market which differs substantially from that designed for the other regions.
Finally we tested two hypothesis and rejected the null hypothesis in both cases. We set the alpha at 0.05 since it was appropriate for both tests.
The average user score assigned to PC and Xbox one games appears to differ significantly. The same can be said for user scores assigned to Sports and Action Games.
This brings us to the end of our analysis. We would recommend a campaign focused on the PS4 and the Xbox one in North America and Europe. No new emerging players were visible in the data, however, given the gradual decline in xbox one sales, following the previous trends, it is likely that a new x-box is likely to launch soon and advertising timed with the launch of this hardware would likely do well, particularly in North America. In Japan the campaign should be focused on 3DS followed by PS4. It would make sense to have the EU and North American campaign focused on first person shooters and sports games, maybe with sports team tie-ups etc. However, in Japan it would make more sense to invest in any new role playing, immersive games that are being launched or re-released. Finally, even though mature content sells best in North America and Europe, in order to have maximum visibility for the campaign, it would make sense to leave out the mature content except perhaps in parallel streams like advertising on specialized cable networks late at night or using influencer-based marketing strategies instead to apeal to that customer base.